fix(en): improve normalization after voxpopuli results by egenthon-cmd · Pull Request #27 · gladiaio/normalization

egenthon-cmd · 2026-06-30T17:15:15Z

Align ref/hyp WER gaps for corpus aliases, possessives, article refs, elided percentages, and hundred-scale numbers misheard as trailing zero.

What does this PR do?

Align ref/hyp WER gaps for corpus aliases, possessives, article refs,
elided percentages, and hundred-scale numbers misheard as trailing zero.

Type of change

Checklist

Only fill in the section(s) that match your change — delete the rest.

Edit existing language

New/changed word substitutions go in replacements.py, not inline in operators.py
If you changed a config field that can be None: the step reading it still handles None gracefully
Unit tests updated or added
E2e CSV updated if the expected output changed

How was this tested?

uv run pytest tests/

Summary by CodeRabbit

New Features
- Improved English text normalization for numbers, percentages, parliamentary references, and common word aliases.
- Better cleanup of possessive forms and spaced apostrophes after normalization.
Bug Fixes
- More reliably preserves intended number formatting in tricky compound phrases.
- Tightened space-removal behavior between digits to avoid over-collapsing in edge cases.
Documentation
- Updated normalization guidance to reflect the expanded apostrophe and possessive handling.

Align ref/hyp WER gaps for corpus aliases, possessives, article refs, elided percentages, and hundred-scale numbers misheard as trailing zero.

coderabbitai · 2026-06-30T17:15:30Z

Warning

Review limit reached

@egenthon-cmd, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 9 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 22ddad75-aa52-4b4a-ac00-8e6b622500ce

📥 Commits

Reviewing files that changed from the base of the PR and between 387f96a and 3d97f82.

📒 Files selected for processing (7)

docs/steps.md
normalization/languages/english/operators.py
normalization/presets/gladia-3.yaml
normalization/steps/text/remove_spaces_between_adjacent_digits.py
normalization/steps/text/remove_trailing_apostrophe_space.py
tests/unit/languages/english_voxpopuli_normalization_test.py
tests/unit/steps/text/remove_trailing_apostrophe_space_test.py

📝 Walkthrough

Walkthrough

Adds several English normalization fixes targeting VoxPopuli parliamentary corpus: possessive token cleanup in RemoveTrailingApostropheSpaceStep, tightened digit-space regex, parliamentary rule-to-article rewriting and percent-of insertion in operators, new alias replacements, and hundred-compound state tracking in EnglishNumberNormalizer, plus a new test module covering all these cases.

Changes

English VoxPopuli Normalization

Layer / File(s)	Summary
Possessive token cleanup step and docs `normalization/steps/text/remove_trailing_apostrophe_space.py`, `docs/steps.md`	`RemoveTrailingApostropheSpaceStep.__call__` applies a second regex to collapse orphan `" s"` possessive tokens to the base word; docstring and docs describe both behaviors.
Digit-space regex tightening `normalization/steps/text/remove_spaces_between_adjacent_digits.py`	Adds `(?!\s+[a-z]\b)` negative lookahead to the compiled regex, preventing digit-space collapse when followed by a lowercase word token.
Parliamentary reference and percent-of normalization `normalization/languages/english/operators.py`	`fix_one_word_in_numeric_contexts` now rewrites `rule <n>` to `article <n>`, loops to rejoin split subsection suffixes, and inserts `percent` when a digit is followed by `of` without another digit.
VoxPopuli alias replacements `normalization/languages/english/replacements.py`	Maps `pttering`/`puttering`/`putttering`/`puttrich` → `pottering` and `guantnamo` → `guantanamo`.
Hundred-compound number parsing `normalization/languages/english/number_normalizer.py`	Introduces `after_hundred`/`compound_after_hundred` state flags and a `reset_number_phrase_state()` helper; updates zeros, tens, and multipliers branches to produce thousands-scale values for "hundred … zero" compounds.
VoxPopuli normalization tests `tests/unit/languages/english_voxpopuli_normalization_test.py`	New parametrized tests cover alias correction, percent-of patterns, parliamentary references, possessive cleanup, and `EnglishNumberNormalizer` hundred-compound outputs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

lrossillon-gladia
karamouche

Poem

🐇 Hoppity-hop through the corpus we go,
Fixing possessives and hundreds that grow,
"Latvia s" becomes "latvia" with flair,
"rule 42" → "article"—parliament's there!
Percent signs and pottering names fall in line,
The rabbit declares: this normalization's just fine! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and accurately summarizes the English normalization improvements for VoxPopuli results.
Description check	✅ Passed	The description matches the template and includes the change summary, type, checklist items, and testing section.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/update-english-normalization

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

tests/unit/languages/english_voxpopuli_normalization_test.py (1)
26-38: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a positive digit-of percentage regression.

The stack context calls out a new digit-of rewrite, but this table only exercises %, a spelled-out-number positive case, and a digit negative case. A bug in the literal-digit branch would still pass here. Please add something like ("15 of latvia population", "15 percent of latvia population").
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/languages/english_voxpopuli_normalization_test.py` around lines 26
- 38, The percent normalization test coverage in test_percent_of_patterns is
missing a positive digit-`of` case, so a bug in the literal-digit rewrite could
slip through. Extend the parametrized cases in this test to include a
digit-based input like “15 of latvia population” with the expected “15 percent
of latvia population”, so the pipeline.normalize behavior is verified for the
digit-`of` branch as well.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@normalization/languages/english/operators.py`:
- Line 195: The percent-insertion regex in operators.py is too broad and
rewrites legitimate “<number> of …” phrases, including cases that should remain
as the prior rule/article normalization handled them. Tighten the matching logic
around the existing text rewrite near the percent conversion so it only fires in
the intended spoken-percentage contexts (for example, by constraining it with
surrounding words or possessive/proper-noun patterns), and ensure the
`rule`→`article` replacement and the `re.sub(...)` percent conversion in the
same normalization flow no longer conflict on phrases like “rule 142 of the
agenda” or “5 of the members.”

In `@normalization/languages/english/replacements.py`:
- Around line 1746-1751: The VoxPopuli/parliamentary aliases in
ENGLISH_REPLACEMENTS are too broad and are rewriting valid English tokens like
puttering in all normalization paths. Move these entries out of the shared
replacements map and into a VoxPopuli-specific preset or normalization step,
keeping the general English replacement map free of corpus-specific aliases; use
the replacements module and the ENGLISH_REPLACEMENTS symbol to locate the shared
mapping.

In `@normalization/steps/text/remove_trailing_apostrophe_space.py`:
- Around line 19-20: The current remove_trailing_apostrophe_space text
normalization is over-aggressive because it collapses any 2+ letter "<word> s"
pattern, not just possessives. Update remove_trailing_apostrophe_space.py so the
trailing-s handling happens earlier in the apostrophe-stripping flow, or
restrict it to a narrower corpus-specific transform instead of the shared
text_post substitution; use the existing remove_trailing_apostrophe_space
function as the place to remove the global re.sub on "\b([a-z]{2,}) s\b" and
preserve real tokens like "letter s" and "model s".

---

Nitpick comments:
In `@tests/unit/languages/english_voxpopuli_normalization_test.py`:
- Around line 26-38: The percent normalization test coverage in
test_percent_of_patterns is missing a positive digit-`of` case, so a bug in the
literal-digit rewrite could slip through. Extend the parametrized cases in this
test to include a digit-based input like “15 of latvia population” with the
expected “15 percent of latvia population”, so the pipeline.normalize behavior
is verified for the digit-`of` branch as well.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 47c2e6b0-ea95-4762-b791-a1d1a813bc60

📥 Commits

Reviewing files that changed from the base of the PR and between d3a8062 and 387f96a.

📒 Files selected for processing (7)

docs/steps.md
normalization/languages/english/number_normalizer.py
normalization/languages/english/operators.py
normalization/languages/english/replacements.py
normalization/steps/text/remove_spaces_between_adjacent_digits.py
normalization/steps/text/remove_trailing_apostrophe_space.py
tests/unit/languages/english_voxpopuli_normalization_test.py

fix(en): improve normalization after voxpopuli results

387f96a

Align ref/hyp WER gaps for corpus aliases, possessives, article refs, elided percentages, and hundred-scale numbers misheard as trailing zero.

egenthon-cmd requested a review from karamouche June 30, 2026 17:15

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread normalization/languages/english/operators.py Outdated

Comment thread normalization/languages/english/replacements.py

Comment thread normalization/steps/text/remove_trailing_apostrophe_space.py Outdated

fix: feedback from coderabbit

3d97f82

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(en): improve normalization after voxpopuli results#27

fix(en): improve normalization after voxpopuli results#27
egenthon-cmd wants to merge 2 commits into
mainfrom
feat/update-english-normalization

egenthon-cmd commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

egenthon-cmd commented Jun 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

Checklist

Edit existing language

How was this tested?

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

egenthon-cmd commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading